Background

The data presented in this report are part of a study aimed to assess differential gene expression and methylation in vaping versus non-vaping LatinX youths in Pueblo and Denver, CO. Pulmonary function data were also obtained in order to better understand the impacts of vape use on pulmonary function. To assess differential gene expression and methylation, naso-epithelial swabs were obtained from each participating subject. Pulmonary function is assessed using PFTs (Pulmonary Function Tests) and Impulse Oscillometry (IOS).

Study Population

This data set consists of samples taken from 51 people ages 12-17 yrs from the Pueblo, Denver, and Aurora, CO areas. Vape Status (did you vape in the last 6 months?) and ethnicity are self-reported.

Methods

All analyses performed using R version 4.1.2 (2021-11-01)

Clinical Data Pocessing

Vape Status

Subjects are dichotomized to those that used a vaping device in the last 6 months and those who have not based on the variables ever_vape, vape_days, and last_vape. This variable will be referred to as Vape Status throughout this report. One participant (SID = 111) reported that they had used a vaping device 5 out of the last 30 days, but did not respond to last_vape. That participant is recorded as Vaped in this analysis.

Sex

Biological sex has been derived from methylation data utilizing the getSex function from the R package minfi 1.40.0.

Geographic Location

Subjects’ geographic location, city, was grouped into the new broader variable recruiting_center which encompasses the broader geographic region where they live.

Lung Function and IOS

Measures of lung function and IOS were visually inspected for normality using histograms.

Gene-Count Processing

Annotation: Ensembl annotation for GrCH 39 ver. 37
Removal of Unwanted Variance: edgeR 3.36.0 and RUVSeq 1.28.0
Differential Expression Analysis: DESeq2 1.34.0

Gene Filtering Parameters
This analysis will conduct a comparison of various gene-filtering parameters presented in previous analyses and in the current literature to select parameters best-suited for this study.

Normalization
The following analyses used the function RUVr from the R package RUVSeq. RUVr uses the deviance residuals from a first pass negative binomial GLM to perform a factor analysis which corrects for unwanted technical effects. The first-pass model formula is presented below :

\[raw \ read \ count \sim \beta_0 + \beta_1 * vape \ status \ + \beta_2 *male \ + \beta_3 * age\]

RUVr will be performed with k = 1 through k = 2 factors and the best k for factor analysis will be determined visually using an elbow plot, RLE plots, and dendrograms. Previous analyses used R package DESeq2 to fit the first-pass GLM. This analysis will use edgeR due to its reference in the literature for the RUVr procedure mentioned above.

Transformations
A variance stabilizing transformation (VST) was applied in previous analyses. This analysis will use the same transformation, but will apply the transformation only after normalization.

Results

Clinical Data Processing Results

After removing participants with missing values for vape status, we are left with n = 50 subjects. Previous analyses showed n = 12 participants had vaped in the last 6 months. This analysis will use n = 13 participants who had vaped in the last 6 months. The lung function variable fev1 and fev1_fvc reported 22 missing values. IOS measures r5 and x20 reported 1 and 6 missing values, respectively.

Table 1: Clinical Data

#create table 1
vape_6mo_table1 <- tab1_dat %>% 
  tableby(vape_6mo_lab ~ sex_lab + age + recruitment_center + 
            latino_lab + fev1 + fev1_fvc + 
            r5 + x20, data = ., digits = 1, test = FALSE )

#Fix Labels
arsenal::labels(vape_6mo_table1) <- c(vape_6mo_lab = "Vaped in last 6 months", 
                             age = "Age (yrs)", sex_lab = "Sex", 
                             recruitment_center = "Recruitment Center", 
                             latino_lab = "Ethnicity", fev1 = 'FEV1',
                             fev1_fvc = "FEV1/FVC (%)", r5 = 'R5', x20 = 'X20')

#Print Tables
summary(vape_6mo_table1, pfootnote = T)
Did Not Vape in Last 6 Months (N=37) Vaped in Last 6 Months (N=13) Total (N=50)
Sex
   Female 21 (56.8%) 5 (38.5%) 26 (52.0%)
   Male 16 (43.2%) 8 (61.5%) 24 (48.0%)
Age (yrs)
   Mean (SD) 14.6 (1.4) 14.8 (1.4) 14.6 (1.4)
   Range 12.0 - 17.0 13.0 - 17.0 12.0 - 17.0
Recruitment Center
   Aurora 15 (40.5%) 0 (0.0%) 15 (30.0%)
   CommCity/Denver 13 (35.1%) 1 (7.7%) 14 (28.0%)
   Pueblo 9 (24.3%) 12 (92.3%) 21 (42.0%)
Ethnicity
   LatinX 23 (62.2%) 11 (84.6%) 34 (68.0%)
   Non-LatinX 14 (37.8%) 2 (15.4%) 16 (32.0%)
FEV1
   N-Miss 10 12 22
   Mean (SD) 2.6 (0.7) 3.9 (NA) 2.6 (0.7)
   Range 1.2 - 3.9 3.9 - 3.9 1.2 - 3.9
FEV1/FVC (%)
   N-Miss 10 12 22
   Mean (SD) 0.8 (0.1) 0.7 (NA) 0.8 (0.1)
   Range 0.5 - 1.0 0.7 - 0.7 0.5 - 1.0
R5
   N-Miss 1 0 1
   Mean (SD) 4.0 (0.9) 5.0 (1.3) 4.3 (1.1)
   Range 2.0 - 6.1 3.7 - 7.6 2.0 - 7.6
X20
   N-Miss 4 2 6
   Mean (SD) 0.1 (0.6) 0.7 (0.9) 0.2 (0.7)
   Range -1.1 - 2.4 -1.0 - 2.3 -1.1 - 2.4
#Fix sex Bar Plot
sex_plot_trial <- tab1_dat %>% 
  group_by(sex_lab, vape_6mo_lab) %>% 
  summarise(N = n())


#By Recruitment sex (n = 50)
bar_sex <- sex_plot_trial %>% 
  ggplot(aes(x = sex_lab, y = N, fill = vape_6mo_lab)) + 
  geom_bar(stat = 'identity', position = 'dodge') +
  labs(y = "Count", fill = "Vape Status") + 
  ggtitle("Sex") +
  scale_fill_manual(values = brewer.pal(3, "Set2"))+
  theme(axis.title.x=element_blank())

#By Latino(n = 50)
bar_latino <- tab1_dat %>% 
  ggplot(aes(x = latino_lab, fill = vape_6mo_lab))+
  geom_bar(position = "dodge")+
  labs( y = "Count", fill = "Vape Status") +
  ggtitle("LatinX") +
  scale_fill_manual(values = brewer.pal(3, "Set2")) +
  theme(axis.title.x=element_blank())

#Fix Recruitment Center Bar Plot
center_plot_trial <- tab1_dat %>% 
  group_by(recruitment_center, vape_6mo_lab) %>% 
  summarise(N = n())

center_plot_trial[nrow(center_plot_trial) + 1,] <- NA

center_plot_trial[6,] <- list("Aurora", "Vaped in Last 6 Months", 0)

#By Recruitment Center
recruitment_center_hist <- center_plot_trial %>% 
  ggplot(aes(x = recruitment_center, y = N, fill = vape_6mo_lab)) + 
  geom_bar(stat = 'identity', position = 'dodge') +
  labs(y = "Count", fill = "Vape Status") + 
  ggtitle("Center")+
  scale_fill_manual(values = brewer.pal(3, "Set2")) +
  theme(axis.title.x=element_blank())


#By FEV1/FVC Continuous
hist_fev1_fvc <- tab1_dat %>%
  ggplot(aes(x = fev1_fvc, fill = vape_6mo_lab))+
  geom_histogram(binwidth = 0.1, breaks = seq(0.5,1,0.1), position = "dodge")+
  labs(x = TeX("\\frac{FEV1}{FVC}"), y = "Count", fill = "Vape Status:") +
  xlim(0.5,1) + 
  scale_fill_manual(values = brewer.pal(3, "Set2")) +
  ggtitle("FEV1/FVC")

#By FEV1/FVC Box
box_fev1_fvc <- tab1_dat %>%
  ggplot(aes(x = vape_6mo_lab, y = fev1_fvc, fill = vape_6mo_lab)) +
  geom_boxplot(show.legend = F) +
  labs(x = "Vape Status", y = TeX("\\frac{FEV1}{FVC}")) +
  scale_fill_manual(values = brewer.pal(3, "Set2")) +
  ggtitle(TeX("\\frac{FEV1}{FVC} \ by Vape Status (n = 28)"))

#By R5(hist)
r5_hist <- tab1_dat %>%
  ggplot(aes(x = r5, fill = vape_6mo_lab)) +
  geom_histogram(binwidth = 1, breaks = seq(1,8,1), position = 'dodge') +
  labs(x = "R5", y = "Count", fill = "Vape Status") +
  xlim(1,8) +
  scale_fill_manual(values = brewer.pal(3, "Set2")) +
  ggtitle("R5")

#by R5(box)
r5_box <- tab1_dat %>%
  ggplot(aes(x = vape_6mo_lab, y = r5, fill = vape_6mo_lab)) +
  geom_boxplot(show.legend = F) +
  labs(x = "Vape Status", y = "R5") +
  ggtitle("R5 by Vape Status (n = 49)") +
  scale_fill_manual(values = brewer.pal(3, "Set2")) +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank())

#by X20(box)
x20_box <- tab1_dat %>%
  ggplot(aes(x = vape_6mo_lab, y = x20, fill = vape_6mo_lab)) +
  geom_boxplot(show.legend = F) +
  labs(x = "Vape Status", y = "X20") +
  ggtitle("X20 by Vape Status (n = 44)") +
  scale_fill_manual(values = brewer.pal(3, "Set2")) +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank())

#by x20(his)
x20_hist <- tab1_dat %>%
  ggplot(aes(x = x20, fill = vape_6mo_lab)) +
  geom_histogram(bins = 10, breaks = seq(-2,3,1), position = "dodge") +
  labs(x = 'X20', y = "Count", fill = 'Vape Status') +
  scale_fill_manual(values = brewer.pal(3, "Set2")) +
  ggtitle("X20")

Figure 1: Population Demographics

Figure 1 visualizes the demographic information presented in Table 1. The majority of vaping subjects were recruited in Pueblo (92%) and identified as LatinX (85%). 62% of subjects were male as confirmed from the available methylation data.

Figure 2: Pulmonary Function

Figure 2 is a visualization of the pulmonary function (\(\frac{FEV1}{FVC}\)) and IOS (R5 and X20) variables. \(\frac{FEV1}{FVC}\) was only completed by n = 22 individuals from the study population. R5 and X20 represent n = 49 and n = 44 individuals, respectively.

#Respiratory Figures
ggarrange(hist_fev1_fvc, ggarrange(r5_box, x20_box, ncol = 2, nrow = 1, legend = "none"), nrow = 2, common.legend = T, legend = "bottom")

Figure 3: Correlation Matrix

Figure 3 identifies correlation patterns in the clinical variables of interest. To test for correlation, a contingency table was made for each comparison of variables. because some contingency tables contained cells with <5 subjects, a Fisher’s Exact test was used for all tests.

From the correlation matrix, it appears that there may be significant correlation between vape status and recruitment center. A separate t-test was run to determine independence between vape status and age.

Test Variable Group 1 Group 2 Mean Group 1 Mean Group 2 t df p-value
Age Did Not Vape in Last 6 Months Vaped in last 6 months 14.583 14.909 -0.654 16.561 0.522

Gene-Count Processing Results

After merging clinical metadata with gene-count data, there were n = 47 samples present. From the clinical metadata, SID = 105, 137, and 103 are missing genetic data. Sample 23 (SID = 144) was also removed due to missing vape status.

Table 2: Comparison of Gene Filtering Parameters

The table below compares gene-filtering parameters from previous analyses to a parameter presented by the creators of the RUVseq package.

source(here("Code/exploratory_report/02_gene_filter_comparison.R"), local = knitr::knit_global())

filter_compar
Filter Incluison Criteria Gene Count Before Gene Count After Genes Removed
1 At least 25% of the samples have > 0 reads 60,651 31,505 29,146
2 The range of reads across all samples < 100 31,505 16,860 14,645
3 >5 reads in at least 2 samples (Bioconductor) 60,651 29,141 31,510

Figure 4: Genes removed for read-count cutoff values

The figure below is used to visually assess how many genes are removed across cutoff values for the range of reads for each gene across the 47 samples after filtering out genes that have 0 read counts in 75% or more of the samples. The red point represents the number of genes (14,645) removed at the cutoff range of 100 (the value used in previous analyses).

After reviewing the comparison of filters, it appears that filtering parameters presented by the creators of RUVSeq may be overly conservative for this application. Analyses will proceed using the same filters as previous analyses (filters 1 & 2 from table 2). The first filter will remove genes with 0 reads in more than 75% of samples, and the second will remove genes with low variation (range of reads < 100) that may be considered “house-keeping” genes. This analysis will include a total of n = 16,860 genes.

Figure 5: Relative Loge Expression and Principal Component Analysis Prior to RUV

The following figure shows the Relative Log Expression (RLE) and Principal Component Analysis (PCA) of read counts for each sample without any prior transformations.

Figure 6: RLE and PCA Excluding Outlier Samples

It appears that Sample 12 (SID = 102) may be an outlier. Below are the RLE and PCA plots with the sample removed.

To further assess the removal of sample 12, the following is an Elbow Plot which displays the reduction in within-cluster sums of squares when increasing the number of clusters (k) both with and without sample 12.

Figure 7: Elbow Plot Comparison of RUVr Factor Inclusion

To visually inspect for the best cutoff for factor analysis, RUVr was run for k = 1 and k = 2 factors both with and without Sample 12 included in the dataset. The RLE and PCA plots for each are presented below.

Figure 8: RLE Comparison of RUVr Factor Inclusion (Sample 12 in dataset)

As was discussed previously, it appears that Sample 12 may be an outlier. Below are the RLE plots after RUVr was conducted with k = 1 and k = 2 with Sample 12 removed from the dataset.

Figure 9: RLE Comparison of RUVr Factor Inclusion (Sample 12 removed from dataset)

To help additionally compare, the following are side-by-side PCA plots comparing samples with and without sample 12 removed before RUVr, after RUVr with k = 1, and after RUVr with k = 2.

Figure 10: PCA Comparison of RUVr Factor Inclusion

The dendrograms for k = 1 and k = 2 are compared below with and without Sample 12 included.

Figure 11: Dendrogram Comparison of RUVr Factor Inclusion (k = 1 and k = 2) (Sample 12 in dataset)

Figure 12: Dendrogram Comparison of RUVr Factor Inclusion (k = 1 and k = 2) (Sample 12 removed)

Using the figures above (with special attention towards Figure 10), It seems appropriate to keep sample 12 and use k = 2 factors for RUVr normalization.

Notes

  • Between this report and it’s previous version (“teen_vape_exploratory_report_2022_04_15”), the variable LatinX was included in the model as opposed to the variable Age. This report corrects that mistake. Visually, it seems to make little differences in the figures used to select K.